Tidal Analysis Comparison: UTide vs PyTides¶

We'll look into this notebook the ways of separating:

  • the storm surge (the meteorologically-driven component)
  • from the astronomical tide (the predictable gravitational component).

Study Location¶

We've chosen Roscoff, France as our test case - with some of the largest tidal ranges in Europe (up to 9 meters). from the IOC database, extracted using searvey, station rosc

The data and tools compared:¶

We'll evaluate the following libraries for the (de)tide analysis:

  1. UTide - Python version of the MatLab software
  2. pytides2 - fork from the official repository, working for new versions of python
  3. FES2022. We'll also compare results against the FES 2022 global tidal model to see how our calculated coefficients stack up against this state-of-the-art reference.

Setting Up Our Analysis Toolkit¶

First, let's import the libraries we'll need. Each serves a specific purpose in our tidal detective work:

In [1]:
import searvey
import shapely
import utide
from pytides2.tide import Tide
import pandas as pd
import hvplot.pandas
import ioc_cleanup as C

We define the FES_CONSTITUENTS - these are the tidal components included in the FES 2022 model, representing the most important tidal harmonics globally.

We just reordered the consituent according to their frequencies and importance

In [2]:
UTIDE_OPTS = {
    "constit": "auto", 
    "method": "ols", 
    "order_constit": "frequency",
    "Rayleigh_min": 0.97,  # High threshold for constituent resolution
    "verbose": True
}

FES_CONSITUENTS = [
    "M2", "S2", "N2", "K2", "2N2", "L2", "T2", "R2", "NU2", "MU2", "EPS2", "LAMBDA2", # Semi-diurnal (twice daily)
    "K1", "O1", "P1", "Q1", "J1", "S1", # Diurnal (once daily)
    "MF", "MM", "MSF", "SA", "SSA", "MSQM", "MTM", # Long period (fortnightly to annual)
    "M4", "MS4", "M6", "MN4", "N4", "S4", "M8", "M3", "MKS2" # Short period (higher harmonics)
]

functions

In [38]:
def data_availability(series: pd.Series, freq="60min") -> float:
    resampled = series.resample(freq).mean()
    data_avail_ratio = 1 - resampled.isna().sum() / len(resampled)
    return float(data_avail_ratio)


def utide_get_coefs(ts: pd.Series, lat: float, resample: int = None)-> dict: 
    UTIDE_OPTS["lat"] = lat
    if resample is not None:
        ts = ts.resample(f"{resample}min").mean()
        ts = ts.shift(freq=f"{resample / 2}min")  # Center the resampled points
    return utide.solve(ts.index, ts, **UTIDE_OPTS)


def utide_surge(ts: pd.Series, lat: float, resample: int = None)-> pd.Series: 
    ts0 = ts.copy()
    coef = utide_get_coefs(ts, lat, resample)
    tidal = utide.reconstruct(ts0.index, coef, verbose = UTIDE_OPTS["verbose"])
    return pd.Series(data=ts0.values - tidal.h, index = ts0.index)


def pytide_get_coefs(ts: pd.Series, resample: int = None) -> dict:
    if resample is not None:
        ts = ts.resample(f"{resample}min").mean()
        ts = ts.shift(freq=f"{resample / 2}min")  # Center the resampled points
    ts = ts.dropna()
    return Tide.decompose(ts, ts.index.to_pydatetime())[0]


def pytides_surge(ts: pd.Series, resample: int = None)-> pd.Series:
    ts0 = ts.copy()
    tide = pytide_get_coefs(ts, resample)
    t0 = ts.index.to_pydatetime()[0]
    hours = (ts.index - ts.index[0]).total_seconds()/3600
    times = Tide._times(t0, hours)
    return pd.Series(ts0.values - tide.at(times), index=ts0.index)


def reduce_coef_to_fes(df: pd.DataFrame):
    res = pd.DataFrame(0.0, index=FES_CONSITUENTS, columns=df.columns)
    common_constituents = df.index.intersection(FES_CONSITUENTS)
    res.loc[common_constituents] = df.loc[common_constituents]
    
    # Report what's different from FES
    not_in_fes_df = df[~df.index.isin(FES_CONSITUENTS)]
    not_in_fes = not_in_fes_df.index.tolist()
    not_in_fes_amps = not_in_fes_df["amplitude"].round(3).tolist()
    missing_fes = set(FES_CONSITUENTS) - set(df.index)
    
    # print(f"Constituents found but not in FES: {not_in_fes}")
    # print(f"Their amplitudes: {not_in_fes_amps}")
    # if missing_fes:
    #     print(f"FES constituents missing from analysis (set to 0): {sorted(missing_fes)}")
    
    return res

study site

In [4]:
station = "rosc"
sensor = "rad"

get 25 years of data

In [5]:
raw = searvey.fetch_ioc_station( 
    station, 
    "2000-01-01", 
    pd.Timestamp.now()
)
raw.describe()
Out[5]:
rad
count 7.669871e+06
mean 4.698489e+00
std 8.238066e+00
min -9.999990e+01
25% 3.528800e+00
50% 5.369300e+00
75% 7.094100e+00
max 9.985000e+00

Station Metadata

In [6]:
# Get station metadata
ioc_df = searvey.get_ioc_stations()
_lat = ioc_df[ioc_df.ioc_code == station].lat.values[0]

station_info = ioc_df[ioc_df.ioc_code == station]
print(f"Station: {station_info['location'].values[0]}")
print(f"Latitude: {_lat:.4f}°N")
print(f"Longitude: {station_info['lon'].values[0]:.4f}°E")
print(f"Country: {station_info['country'].values[0]}")
print("\nFull station details:")
station_info
Station: Roscoff
Latitude: 48.7190°N
Longitude: -3.9660°E
Country: France

Full station details:
Out[6]:
ioc_code gloss_id country location connection dcp_id last_observation_level last_observation_time delay interval ... observations_expected_per_week observations_ratio_per_week observations_arrived_per_month observations_expected_per_month observations_ratio_per_month observations_ratio_per_day sample_interval average_delay_per_day transmit_interval geometry
1268 rosc <NA> France Roscoff ftp NaN 7.97 05:34 2' 5' ... 10080.0 100 43172 43200.0 100 100 1' 3' 5' POINT (-3.966 48.719)

1 rows × 25 columns

let's clean the data, using ioc_cleanup

In [7]:
!mkdir -p transformations
import requests
response = requests.get(f'https://raw.githubusercontent.com/seareport/ioc_cleanup/refs/heads/ioc_2024/transformations/{station}_{sensor}.json')
with open(f'transformations/{station}_{sensor}.json', 'wb') as f:
    f.write(response.content)
Out[7]:
34930

here's a snapshot at the cleaning trasnformation file

In [8]:
! head -20 "transformations/{station}_{sensor}.json" 
{
  "ioc_code": "rosc",
  "sensor": "rad",
  "notes": "",
  "skip": false,
  "wip": false,
  "start": "2022-01-01T00:00:00",
  "end": "2025-01-01T00:00:00",
  "high": null,
  "low": null,
  "dropped_date_ranges": [
    ["2022-03-27T03:00:00", "2022-03-27T03:59:00"],
    ["2022-12-02T13:35:00", "2022-12-02T13:39:00"],
    ["2023-03-26T03:00:00", "2023-03-26T03:59:00"]
  ],
  "dropped_timestamps": [
    "2022-09-01T01:57:00",
    "2023-05-01T09:03:00",
    "2023-05-01T09:04:00",
    "2023-05-01T09:05:00",

Now let's apply these transformations to clean our data:

In [9]:
# Load and apply quality control transformations
trans = C.load_transformation(station, sensor)
cleaned_data = C.transform(raw, trans)
ts = cleaned_data[sensor]

print(f"Data cleaning complete!")
print(f"Original data points: {len(raw)}")
print(f"Cleaned data points: {len(ts)}")
print(f"Data availability: {data_availability(ts):.1%}")
print(f"Time range raw: {raw.index.min()} to {raw.index.max()}")
print(f"Time range clean: {ts.index.min()} to {ts.index.max()}")
Data cleaning complete!
Original data points: 7669871
Cleaned data points: 1548102
Data availability: 99.6%
Time range raw: 2009-09-17 23:56:24 to 2025-07-11 05:29:00
Time range clean: 2022-01-01 00:01:00 to 2025-01-01 00:00:00

1: UTide Analysis¶

In [10]:
out = utide_get_coefs(ts, _lat, resample=20)
print(f"Found {len(out['name'])} tidal constituents")
solve: matrix prep ... solution ... done.
Found 68 tidal constituents

Let's organize the UTide results into a clean DataFrame:

In [11]:
def utide_to_df(utide_coef: utide.utilities.Bunch) -> pd.DataFrame:
    return pd.DataFrame({ 
        "amplitude": utide_coef["A"],
        "phase": utide_coef["g"],
        "amplitude_CI": utide_coef["A_ci"],
        "phase_CI": utide_coef["g_ci"]
    }, index=utide_coef["name"])

print("Top 20 tidal constituents by amplitude (UTide):")
print(utide_to_df(out).sort_values('amplitude', ascending=False).head(20))
Top 20 tidal constituents by amplitude (UTide):
      amplitude       phase  amplitude_CI   phase_CI
M2     2.698727  141.712703      0.003081   0.065412
S2     1.015954  187.427755      0.003081   0.173769
N2     0.533240  123.751469      0.003081   0.331070
K2     0.291735  184.513174      0.003081   0.605060
MU2    0.141442  150.748488      0.003081   1.248147
M4     0.119954  140.948298      0.001153   0.550953
NU2    0.100100  111.967316      0.003081   1.763576
L2     0.096260  138.833846      0.003081   1.833901
2N2    0.088861  113.525695      0.003081   1.986781
K1     0.081898   81.700949      0.001277   0.894096
MS4    0.081814  191.957568      0.001153   0.807797
O1     0.074173  333.543720      0.001278   0.986714
SA     0.060349  304.358930      0.016419  15.386364
T2     0.054464  175.609556      0.003081   3.241401
MN4    0.047274  113.279430      0.001153   1.397993
LDA2   0.046476  110.731276      0.003081   3.798172
SSA    0.046149   79.602077      0.016350  20.206857
EPS2   0.035734  129.866658      0.003081   4.940098
MSN2   0.029038  332.005902      0.003081   6.079230
P1     0.028376   72.451468      0.001278   2.579763

Round 2: PyTides Analysis¶

In [12]:
out_pytides = pytide_get_coefs(ts, 20)
print(f"Found {len(out_pytides.model['constituent'])} tidal constituents")
/home/tomsail/work/gist/tide_best_practice/.venv/lib/python3.11/site-packages/pytides2/tide.py:438: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  heights = heights[sort]
Found 38 tidal constituents

Let's organize the PyTides results:

In [13]:
def pytides_to_df(pytides_tide: Tide)-> pd.DataFrame:
    constituent_names = [c.name.upper() for c in pytides_tide.model['constituent']]
    return pd.DataFrame(pytides_tide.model, index=constituent_names).drop('constituent', axis=1)

print("Top 20 tidal constituents by amplitude (PyTides):")
print(pytides_to_df(out_pytides).sort_values('amplitude', ascending=False).head(20))
Top 20 tidal constituents by amplitude (PyTides):
         amplitude       phase
Z0        5.366716    0.000000
M2        2.696960  141.810166
S2        1.018143  187.379524
N2        0.533421  123.834273
K2        0.292225  184.647714
MU2       0.143417  148.931139
M4        0.119931  141.134845
NU2       0.099544  112.237101
L2        0.096016  138.895029
2N2       0.083984  106.947072
K1        0.081825   81.773947
MS4       0.081581  192.032615
O1        0.074254  333.333368
SA        0.068742  222.775933
T2        0.054274  175.715530
SSA       0.050564   82.881197
MN4       0.047301  113.398872
LAMBDA2   0.046093  111.441733
2SM2      0.035839    5.226947
P1        0.028065   72.881608

Comparison¶

To fairly compare UTide and pytides results, we'll standardize them against the FES 2022 constituent list. This will show us:

  1. Which constituents each method found
  2. Which constituents are missing from each analysis
  3. How the amplitudes compare for common constituents
In [82]:
pytides_reduced_coef = reduce_coef_to_fes(pytides_to_df(out_pytides))
pytides_reduced_coef.head(10)

utide_reduced_coef = reduce_coef_to_fes(utide_to_df(out))
utide_reduced_coef.head(10)
Out[82]:
amplitude phase
M2 1.381479 304.081013
S2 0.347957 15.304028
N2 0.217599 279.213033
K2 0.104728 12.360123
2N2 0.042887 280.571507
L2 0.108715 327.051615
T2 0.019692 355.733042
R2 0.001334 211.847339
NU2 0.088104 260.693122
MU2 0.160902 26.733250
Out[82]:
amplitude phase amplitude_CI phase_CI
M2 2.698727 141.712703 0.003081 0.065412
S2 1.015954 187.427755 0.003081 0.173769
N2 0.533240 123.751469 0.003081 0.331070
K2 0.291735 184.513174 0.003081 0.605060
2N2 0.088861 113.525695 0.003081 1.986781
L2 0.096260 138.833846 0.003081 1.833901
T2 0.054464 175.609556 0.003081 3.241401
R2 0.006671 223.102025 0.003081 26.463453
NU2 0.100100 111.967316 0.003081 1.763576
MU2 0.141442 150.748488 0.003081 1.248147

visual comparison¶

What to look for:

  • Major constituents (M2, S2, N2, K1, O1) should have similar amplitudes
  • Minor constituents may show more variation between methods
  • Missing constituents appear as zero amplitude in one method but not the other
In [83]:
def concat_utide_pytides(pytides_df, utide_df):
    multi_df = pd.concat({"pytides": pytides_df, "utide": utide_df})
    multi_df.index.names = ['method', 'constituent']
    multi_df = multi_df.swaplevel().sort_index()

    available_constituents = multi_df.index.get_level_values('constituent').unique()
    filtered_order = [c for c in FES_CONSITUENTS if c in available_constituents][::-1]
    return  multi_df.reindex(filtered_order, level='constituent')

multi_df_ordered = concat_utide_pytides(pytides_reduced_coef, utide_reduced_coef)
multi_df_ordered
Out[83]:
amplitude phase amplitude_CI phase_CI
constituent method
MKS2 pytides 0.000000 0.000000 NaN NaN
utide 0.013785 234.094592 0.003081 12.806398
M3 pytides 0.002746 92.236769 NaN NaN
utide 0.014113 86.700040 0.000382 1.548933
M8 pytides 0.013484 48.585303 NaN NaN
... ... ... ... ... ...
N2 utide 0.533240 123.751469 0.003081 0.331070
S2 pytides 0.347957 15.304028 NaN NaN
utide 1.015954 187.427755 0.003081 0.173769
M2 pytides 1.381479 304.081013 NaN NaN
utide 2.698727 141.712703 0.003081 0.065412

68 rows × 4 columns

In [16]:
# Create the comparison plot

def plot_comparative_amplitudes(df):
    return df.amplitude.hvplot.barh(
        ylabel="Tidal Constituent",
        xlabel="Amplitude (meters)", 
        by="method", 
        grid=True,
        title=f"Tidal Amplitudes: UTide vs PyTide, station {station}",
        legend='top_right',
        rot=90
    ).opts(
        height=1000, 
        width=1000,
        fontsize={'title': 15, 'labels': 12, 'xticks': 8, 'yticks': 8}
    )

plot_comparative_amplitudes(multi_df_ordered)
Out[16]:

Quantitave comparison¶

we'll assess the RSS between the all the consituents, taking pytide as the reference:

RSS is given by: [1]

$$ \operatorname{RSS} = \sum_{i=1}^{n} \left(A_{pytides,i} - A_{utide,i}\right)^2 $$

In [17]:
def compute_rss(df): 
    amp_pytides = df.xs('pytides', level='method')['amplitude']
    amp_utide = df.xs('utide', level='method')['amplitude']
    # Ensure both Series are aligned by index
    amp_pytides, amp_utide = amp_pytides.align(amp_utide, join='inner')
    # Compute RSS
    return ((amp_pytides - amp_utide) ** 2).sum()

print(f"rss for {station} is {compute_rss(multi_df_ordered):.3f}")
rss for rosc is 0.004

we'll iterate though an existing folder, contaning clean data at tide gauge locations.

In [ ]:
DATA_FOLDER = "data"
UTIDE_OPTS["verbose"] = False
import glob
import os 


res = {}
for path in glob.glob("data/*parquet"): 
    ts = pd.read_parquet(path)
    ts = ts[ts.columns[0]]
    root, file_ext = os.path.split(path)
    file, ext = os.path.splitext(file_ext)
    station, sensor = file.split("_")
    _lon = ioc_df[ioc_df.ioc_code == station].lon.values[0]
    _lat = ioc_df[ioc_df.ioc_code == station].lat.values[0]
    try: 
        ut = utide_get_coefs(ts, _lat, resample=20)
        utide_reduced_coef = reduce_coef_to_fes(utide_to_df(ut))
        pt = pytide_get_coefs(ts, 20.)
        pytides_reduced_coef = reduce_coef_to_fes(pytides_to_df(pt))
        multi_df_ordered = concat_utide_pytides(pytides_reduced_coef, utide_reduced_coef)
        rss = compute_rss(multi_df_ordered)
        res[station] = {
            "ioc_code": station,
            "lat": _lat,
            "lon": _lon,
            "rss": rss
        }
        print(f"rss for {station} is {rss:.4f}")
    except Exception as e:
        print(f"couldn't process {station}")
In [70]:
rss_df = pd.DataFrame(res).T

rss_df.rss = rss_df.rss.astype(float)
rss_df.hvplot.points(
    x="lon",
    y="lat",
    c="rss",
    hover_cols = ['ioc_code'],
    s=100,
    geo = True,
    tiles = True
).opts(width = 1000, height = 800, title = "RSS difference between UTide and pytides constituents")
Out[70]:

plot tidal amplitude from station:

In [75]:
station= "delf"
sensor = "flt"
ioc_df = searvey.get_ioc_stations()

rsp = 20

_lat = ioc_df[ioc_df.ioc_code == station].lat.values[0]
ts = pd.read_parquet(f"data/{station}_{sensor}.parquet")[sensor]
out_pytides = pytide_get_coefs(ts, rsp)
pytides_reduced_coef = reduce_coef_to_fes(pytides_to_df(out_pytides))
out_utides = utide_get_coefs(ts, _lat, resample=rsp)
utide_reduced_coef = reduce_coef_to_fes(utide_to_df(out_utides))
multi_df_ordered = concat_utide_pytides(pytides_reduced_coef, utide_reduced_coef)
compute_rss(multi_df_ordered)
plot_comparative_amplitudes(multi_df_ordered)
/home/tomsail/work/gist/tide_best_practice/.venv/lib/python3.11/site-packages/pytides2/tide.py:438: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  heights = heights[sort]
Out[75]:
np.float64(0.004367605911900469)
Out[75]:

comparison between tidal residuals¶

If both methods are working correctly, the tidal residuals - corresponding to the meteorological component - time series should be very similar.

Significant differences would indicate problems with utide, pytides or both approaches.

In [76]:
print("Calculating storm surge using both methods...")
rsp = 30
surge_pytides = pytides_surge(ts, resample=rsp)
surge_utide = utide_surge(ts, _lat, resample=rsp)

correlation = surge_pytides.corr(surge_utide)
rmse = ((surge_pytides - surge_utide)**2).mean()**0.5

print(f"--------\n📊 Storm Surge Comparison Results:")
print(f"Correlation coefficient: {correlation:.4f}")
print(f"RMSE between methods: {rmse:.3f} meters")
Calculating storm surge using both methods...
/home/tomsail/work/gist/tide_best_practice/.venv/lib/python3.11/site-packages/pytides2/tide.py:438: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  heights = heights[sort]
--------
📊 Storm Surge Comparison Results:
Correlation coefficient: 0.9540
RMSE between methods: 0.096 meters
In [77]:
(surge_pytides.resample("1h").mean().hvplot(label="sugre pytides", grid=True)
 *surge_utide.resample("1h").mean().hvplot(label="surge utide")
 ).opts(
    width=1200,
    height = 500
)
Out[77]:

Second part: chunked detiding (to be continued)¶

In [78]:
def surge_chunked(ts: pd.Series, lat: float, resample: int = None, max_days: int = 365) -> pd.Series:
    ts0 = ts.copy()
    if resample is not None:
        ts = ts.resample(f"{resample}min").mean()
        ts = ts.shift(freq=f"{resample / 2}min")

    OPTS = {
        "constit": "auto",
        "method": "ols",
        "order_constit": "frequency",
        "Rayleigh_min": 0.97,
        "lat": lat,
        "verbose": True
    }

    detided = pd.Series(index=ts0.index, dtype='float64')

    t_start = ts.index.min()
    t_end = ts.index.max()
    chunk_start = t_start
    chunk_size = pd.Timedelta(days = max_days)

    while chunk_start < t_end:
        current_chunk_size = chunk_size

        while True:
            chunk_end = chunk_start + current_chunk_size
            if chunk_end > t_end:
                chunk_end = t_end

            chunk = ts[chunk_start:chunk_end]
            avail = data_availability(chunk, freq="60min")
            total_days = current_chunk_size.total_seconds()/(3600*24)
            if total_days*avail >= 365*0.9:
                print(f"Detiding chunk {chunk_start.date()} to {chunk_end.date()} ({avail*100:.1f}% available)")
                try:
                    coef = utide.solve(
                        chunk.index,
                        chunk,
                        **OPTS
                    )
                    recon_index = ts0.loc[chunk_start:chunk_end].index
                    tidal = utide.reconstruct(recon_index, coef, verbose=OPTS["verbose"])
                    detided.loc[chunk_start:chunk_end] = ts0.loc[chunk_start:chunk_end].values - tidal.h
                except Exception as e:
                    print(f"UTide failed on chunk {chunk_start} to {chunk_end}: {e}")
                break
            else:
                print(f"Data availability {avail:.1f}% from {chunk_start.date()} to {chunk_end.date()} — expanding chunk.")
                current_chunk_size += pd.Timedelta(days=6*30)
                if chunk_start + current_chunk_size > t_end:
                    print("End of time series reached with insufficient data.")
                    break

        chunk_start = chunk_end

    return detided
In [79]:
chunked = surge_chunked(ts, _lat, 20)
Detiding chunk 2022-01-01 to 2023-01-01 (98.8% available)
solve: matrix prep ... solution ... done.
prep/calcs ... done.
Detiding chunk 2023-01-01 to 2024-01-01 (92.0% available)
solve: matrix prep ... solution ... done.
prep/calcs ... done.
Detiding chunk 2024-01-01 to 2024-12-31 (90.4% available)
solve: matrix prep ... solution ... done.
prep/calcs ... done.
In [80]:
(surge_pytides.resample("1h").mean().hvplot(
    label="sugre pytides", grid=True
 )*surge_utide.resample("1h").mean().hvplot(
    label="surge utide"
 )*chunked.resample("1h").mean().hvplot(
    label="chunked utide"
 )).opts(
    width=1200,
    height = 500
)
Out[80]: